Content-addressable Storage
   HOME

TheInfoList



OR:

Content-addressable storage (CAS), also referred to as content-addressed storage or fixed-content storage, is a way to store information so it can be retrieved based on its content, not its name or location. It has been used for high-speed storage and retrieval of fixed content, such as documents stored for compliance with government regulations. Content-addressable storage is similar to
content-addressable memory Content-addressable memory (CAM) is a special type of computer memory used in certain very-high-speed searching applications. It is also known as associative memory or associative storage and compares input search data against a table of stored d ...
. CAS systems work by passing the content of the file through a
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output re ...
to generate a unique key, the "content address". The
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
's
directory Directory may refer to: * Directory (computing), or folder, a file system structure in which to store computer files * Directory (OpenVMS command) * Directory service, a software application for organizing information about a computer network's u ...
stores these addresses and a pointer to the physical storage of the content. Because an attempt to store the same file will generate the same key, CAS systems ensure that the files within them are unique, and because changing the file will result in a new key, CAS systems provide assurance that the file is unchanged. CAS became a significant market during the 2000s, especially after the introduction of the 2002
Sarbanes–Oxley Act The Sarbanes–Oxley Act of 2002 is a United States federal law that mandates certain practices in financial record keeping and reporting for corporations. The act, (), also known as the "Public Company Accounting Reform and Investor Protecti ...
which required the storage of enormous numbers of documents for long periods and retrieved only rarely. Ever-increasing performance of traditional file systems and new software systems have eroded the value of legacy CAS systems, which have become increasingly rare after roughly 2018. However, the principles of content addressability continue to be of great interest to computer scientists, and form the core of numerous emerging technologies, such as
peer-to-peer file sharing Peer-to-peer file sharing is the distribution and sharing of digital media using peer-to-peer (P2P) networking technology. P2P file sharing allows users to access media files such as books, music, movies, and games using a P2P software program tha ...
,
cryptocurrencies A cryptocurrency, crypto-currency, or crypto is a digital currency designed to work as a medium of exchange through a computer network that is not reliant on any central authority, such as a government or bank A bank is a financial i ...
, and
distributed computing A distributed system is a system whose components are located on different computer network, networked computers, which communicate and coordinate their actions by message passing, passing messages to one another from any system. Distributed com ...
.


Description


Location-based approaches

Traditional
file system In computing, file system or filesystem (often abbreviated to fs) is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one larg ...
s generally track files based on their
filename A filename or file name is a name used to uniquely identify a computer file in a directory structure. Different file systems impose different restrictions on filename lengths. A filename may (depending on the file system) include: * name &ndas ...
. On random-access media like a
floppy disk A floppy disk or floppy diskette (casually referred to as a floppy, or a diskette) is an obsolescent type of disk storage composed of a thin and flexible disk of a magnetic storage medium in a square or nearly square plastic enclosure lined w ...
, this is accomplished using a
directory Directory may refer to: * Directory (computing), or folder, a file system structure in which to store computer files * Directory (OpenVMS command) * Directory service, a software application for organizing information about a computer network's u ...
that consists of some sort of list of filenames and pointers to the data. The pointers refer to a physical location on the disk, normally using
disk sector In computer disk storage, a sector is a subdivision of a track on a magnetic disk or optical disc. Each sector stores a fixed amount of user-accessible data, traditionally 512 bytes for hard disk drives (HDDs) and 2048 bytes for CD-ROMs and D ...
s. On more modern systems and larger formats like
hard drive A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...
s, the directory is itself split into many subdirectories, each tracking a subset of the overall collection of files. Subdirectories are themselves represented as files in a parent directory, producing a hierarchy or tree-like organization. The series of directories leading to a particular file is known as a "path". In the context of CAS, these traditional approaches are referred to as "location-addressed", as each file is represented by a list of one or more locations, the path and filename, on the physical storage. In these systems, the same file with two different names will be stored as two files on disk and thus have two addresses. The same is true if the same file, even with the same name, is stored in more than one location in the directory hierarchy. This makes them less than ideal for a
digital archive An archive is an accumulation of historical records or materials – in any medium – or the physical facility in which they are located. Archives contain primary source documents that have accumulated over the course of an individual or ...
, where any unique information should only be stored once. As the concept of the hierarchical directory became more common in
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also in ...
s especially during the late 1980s, this sort of access pattern began to be used by entirely unrelated systems. For instance, the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
uses a similar pathname/filename-like system known as the URL to point to documents. The same document on another
web server A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initiate ...
has a different URL in spite of being identical content. Likewise, if an existing location changes in any way, if the filename changes or the server moves to a new
domain name service The Domain Name System (DNS) is a hierarchical and distributed naming system for computers, services, and other resources in the Internet or other Internet Protocol (IP) networks. It associates various information with domain names assigned t ...
name, the document is no longer accessible. This leads to the common problem of
link rot Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address ...
.


CAS and FCS

Although location-based storage is widely used in many fields, this was not always the case. Previously, the most common way to retrieve data from a large collection was to use some sort of identifier based on the content of the document. For instance, the
ISBN The International Standard Book Number (ISBN) is a numeric commercial book identifier that is intended to be unique. Publishers purchase ISBNs from an affiliate of the International ISBN Agency. An ISBN is assigned to each separate edition and ...
system is used to generate a unique number for every book. If one performs a web search for "ISBN 0465048994", one will be provided with a list of locations for the book ''Why Information Grows'' on the topic of information storage. Although many locations will be returned, they all refer to the same work, and the user can then pick whichever location is most appropriate. Additionally, if any one of these locations changes or disappears, the content can be found at any of the other locations. CAS systems attempt to produce ISBN like results automatically and on any document. They do this by using a
cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output re ...
on the data of the document to produce what is sometimes known as a "key" or "fingerprint". This key is strongly tied to the exact content of the document, adding a single space at the end of the file, for instance, will produce a different key. In a CAS system, the directory does not map filenames onto locations, but uses the keys instead. This provides several benefits. For one, when a file is sent to the CAS for storage, the hash function will produce a key and then check to see if that key already exists in the directory. If it does, the file is not stored as the one already in storage is identical. This allows CAS systems to easily avoid duplicate data. Additionally, as the key is based on the content of the file, retrieving a document with a given key ensures that the stored file has not been changed. The downside to this approach is that any changes to the document produces a different key, which makes CAS systems unsuitable for files that are often edited. For all of these reasons, CAS systems are normally used for archives of largely static documents, and are sometimes known as "fixed content storage" (FCS). Because the keys are not human-readable, CAS systems implement a second type of directory that stores
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
that will help users find a document. These almost always include a filename, allowing the classic name-based retrieval to be used. But the directory will also include fields for common identification systems like ISBN or
ISSN An International Standard Serial Number (ISSN) is an eight-digit serial number used to uniquely identify a serial publication, such as a magazine. The ISSN is especially helpful in distinguishing between serials with the same title. ISSNs ...
codes, user-provided keywords, time and date stamps, and
full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
indexes. Users can search these directories and retrieve a key, which can then be used to retrieve the actual document. Using a CAS is very similar to using a
web search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
. The primary difference is that a web search is generally performed on a topic basis using an internal algorithm that finds "related" content and then produces a list of locations. The results may be a list of the identical content in multiple locations. In a CAS, more than one document may be returned for a given search, but each of those documents will be unique and presented only once. Another advantage to CAS is that the physical location in storage is not part of the lookup system. If, for instance, a library's
card catalog A library catalog (or library catalogue in British English) is a register of all bibliographic items found in a library or group of libraries, such as a network of libraries at several locations. A catalog for a group of libraries is also c ...
stated a book could be found on "shelf 43, bin 10", if the library is re-arranged the entire catalog has to be updated. In contrast, the ISBN number will not change and the book can be found by looking for the shelf with those numbers. In the computer setting, a file in the
DOS DOS is shorthand for the MS-DOS and IBM PC DOS family of operating systems. DOS may also refer to: Computing * Data over signalling (DoS), multiplexing data onto a signalling channel * Denial-of-service attack (DoS), an attack on a communicat ...
filesystem at the path A:\myfiles\textfile.txt points to the physical storage of the file in the myfiles subdirectory. This file disappears if the floppy is moved to the B: drive, and even moving its location within the disk hierarchy requires the user-facing directories to be updated. In CAS, only the internal mapping from key to physical location changes, and this exists in only one place and can be designed for efficient updating. This allows files to be moved among storage devices, and even across media, without requiring any changes to the retrieval. For data that changes frequently, CAS is not as efficient as location-based addressing. In these cases, the CAS device would need to continually recompute the address of data as it was changed. This would result in multiple copies of the entire almost-identical document being stored, the problem that CAS attempts to avoid. Additionally, the user-facing directories would have to be continually updated with these "new" files, which would become polluted by many similar documents that would make searching more difficult. In contrast, updating a file in a location-based system is highly optimized, only the internal list of sectors has to be changed and many years of tuning have been applied to this operation. Because CAS is used primarily for archiving, file deletion is often tightly controlled or even impossible under user control. In contrast, automatic deletion is a common feature, removing all files older than some legally defined requirement, say ten years.


In distributed computing

The simplest way to implement a CAS system is to have all of the files stored within a typical database to which clients connect to add, query and retrieve files. However, the unique properties of content addressability means that the paradigm is well suited for computer systems in which multiple hosts collaboratively manage files with no central authority, such as distributed
file sharing File sharing is the practice of distributing or providing access to digital media, such as computer programs, multimedia (audio, images and video), documents or electronic books. Common methods of storage, transmission and dispersion include r ...
systems, in which the physical location of a hosted file can change rapidly in response to changes in network topography, while the exact content of the files to be retrieved are of more importance to users than their current physical location. In a distributed system, content hashes are often used for quick network-wide searches for specific files, or to quickly see which data in a given file has been changed and must be propagated to other members of the network with minimal
bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...
usage. In these systems, content addressability allows highly variable network topology to be abstracted away from users who wish to access data, compared to systems like the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
, in which a consistent location of a file or service is key to easy use.


History

A hardware device called the
Content Addressable File Store The Content Addressable File Store (CAFS) was a hardware device developed by International Computers Limited (ICL) that provided a disk storage with built-in search capability. The motivation for the device was the discrepancy between the high spee ...
(CAFS) was developed by
International Computers Limited International Computers Limited (ICL) was a British computer hardware, computer software and computer services company that operated from 1968 until 2002. It was formed through a merger of International Computers and Tabulators (ICT), English Ele ...
(ICL) in the late 1960s and put into use by
British Telecom BT Group plc (trade name, trading as BT and formerly British Telecom) is a British Multinational corporation, multinational telecommunications holding company headquartered in London, England. It has operations in around 180 countries and is th ...
in the early 1970s for
telephone directory A telephone directory, commonly called a telephone book, telephone address book, phonebook, or the white and yellow pages, is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization tha ...
lookups. The user-accessible search functionality was maintained by the
disk controller {{unreferenced, date=May 2010 The disk controller is the controller circuit which enables the CPU to communicate with a hard disk, floppy disk or other kind of disk drive. It also provides an interface between the disk drive and the bus conne ...
with a high-level
application programming interface An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how t ...
(API) so users could send queries into what appeared to be a
black box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...
that returned documents. The advantage was that no information had to be exchanged with the host computer while the disk performed the search. Paul Carpentier and Jan van Riel coined the term CAS while working at a company called FilePool in the late 1990s. FilePool was purchased by
EMC Corporation Dell EMC (EMC Corporation until 2016) is an American multinational corporation headquartered in Hopkinton, Massachusetts and Round Rock, Texas, United States. Dell EMC sells data storage, information security, virtualization, analytics, cloud ...
in 2001 and was released the next year as Centera.Content-addressable storage – Storage as I See it
by Mark Ferelli, Oct, 2002, BNET.com
The timing was perfect; the introduction of the
Sarbanes–Oxley Act The Sarbanes–Oxley Act of 2002 is a United States federal law that mandates certain practices in financial record keeping and reporting for corporations. The act, (), also known as the "Public Company Accounting Reform and Investor Protecti ...
in 2002 required companies to store huge amounts of documentation for extended periods and required them to do so in a fashion that ensured they were not edited after-the-fact.USENIX Annual Technical Conference 2003, General Track – Abstract
/ref> A number of similar products soon appeared from other large-system vendors. In mid-2004, the industry group SNIA began working with a number of CAS providers to create standard behavior and interoperability guidelines for CAS systems.CAS Industry standardization activities – XAM: http://www.snia.org/forums/xam In addition to CAS, a number of similar products emerged that added CAS-like capabilities to existing products; notable among these was
IBM Tivoli Storage Manager IBM Spectrum Protect (Tivoli Storage Manager) is a data protection platform that gives enterprises a single point of control and administration for backup and recovery. It is the flagship product in the IBM Spectrum Protect (Tivoli Storage Mana ...
. The rise of
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over mul ...
and the associated elastic cloud storage systems like
Amazon S3 Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e- ...
further diluted the value of dedicated CAS systems.
Dell Dell is an American based technology company. It develops, sells, repairs, and supports computers and related products and services. Dell is owned by its parent company, Dell Technologies. Dell sells personal computers (PCs), servers, data ...
purchased EMC in 2016 and stopped sales of the original Centera in 2018 in favor of their elastic storage product. CAS was not originally associated with
peer-to-peer Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer n ...
applications until the 2000s, when rapidly proliferating
Internet access Internet access is the ability of individuals and organizations to connect to the Internet using computer terminals, computers, and other devices; and to access services such as email and the World Wide Web. Internet access is sold by Internet ...
in homes and businesses led to a large amount of computer users who wanted to swap files, originally doing so on centrally managed services like
Napster Napster was a peer-to-peer file sharing application. It originally launched on June 1, 1999, with an emphasis on digital audio file distribution. Audio songs shared on the service were typically encoded in the MP3 format. It was founded by Shawn ...
. However, an injunction against Napster prompted the independent development of file-sharing services such as BitTorrent, which could not be centrally shut down. In order to function without a central federating server, these services rely heavily on CAS to enforce the faithful copying and easy querying of unique files. At the same time, the growth of the
open-source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Op ...
movement in the 2000s led to the rapid proliferation of CAS-based services such as
Git Git () is a distributed version control system: tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development. Its goals include speed, data in ...
, a
version control In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
system that uses numerous cryptographic functions such as
Merkle tree In cryptography and computer science, a hash tree or Merkle tree is a tree in which every "leaf" (node) is labelled with the cryptographic hash of a data block, and every node that is not a leaf (called a ''branch'', ''inner node'', or ''inode'') ...
s to enforce data integrity between users and allow for multiple versions of files with minimal disk and network usage. Around this time, individual users of
public-key cryptography Public-key cryptography, or asymmetric cryptography, is the field of cryptographic systems that use pairs of related keys. Each key pair consists of a public key and a corresponding private key. Key pairs are generated with cryptographic alg ...
used CAS to store their public keys on systems such as key servers. The rise of
mobile computing Mobile computing is human–computer interaction in which a computer is expected to be transported during normal usage, which allows for the transmission of data, voice, and video. Mobile computing involves mobile communication, mobile hardware ...
and high capacity
mobile broadband Mobile broadband is the marketing term for Wireless broadband, wireless Internet access via mobile networks. Access to the network can be made through a portable modem, wireless modem, or a Tablet computer, tablet/smartphone (possibly Tetherin ...
networks in the 2010s, coupled with increasing reliance on
web application A web application (or web app) is application software that is accessed using a web browser. Web applications are delivered on the World Wide Web to users with an active network connection. History In earlier computing models like client-serve ...
s for everyday computing tasks, strained the existing location-addressed
client–server model The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. Often clients and servers communicate over ...
commonplace among Internet services, leading to an accelerated pace of
link rot Link rot (also called link death, link breaking, or reference rot) is the phenomenon of hyperlinks tending over time to cease to point to their originally targeted file, web page, or server due to that resource being relocated to a new address ...
and an increased reliance on centralized
cloud hosting Cloud computing is the on-demand availability of computer system resources, especially data storage (cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over multip ...
. Furthermore, growing concerns about the
centralization Centralisation or centralization (see spelling differences) is the process by which the activities of an organisation, particularly those regarding planning and decision-making, framing strategy and policies become concentrated within a particu ...
of computing power in the hands of large technology companies, potential
monopoly A monopoly (from Greek language, Greek el, μόνος, mónos, single, alone, label=none and el, πωλεῖν, pōleîn, to sell, label=none), as described by Irving Fisher, is a market with the "absence of competition", creating a situati ...
power abuses, and
privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of a ...
concerns led to a number of projects created with the goal of creating more
decentralized Decentralization or decentralisation is the process by which the activities of an organization, particularly those regarding planning and decision making, are distributed or delegated away from a central, authoritative location or group. Conce ...
systems.
Bitcoin Bitcoin ( abbreviation: BTC; sign: ₿) is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public distr ...
uses CAS and public/private key pairs to manage wallet addresses, as do most other
cryptocurrencies A cryptocurrency, crypto-currency, or crypto is a digital currency designed to work as a medium of exchange through a computer network that is not reliant on any central authority, such as a government or bank A bank is a financial i ...
.
IPFS The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespac ...
uses CAS to identify and address communally hosted files on its network. Numerous other
peer-to-peer Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer n ...
systems designed to run on
smartphone A smartphone is a portable computer device that combines mobile telephone and computing functions into one unit. They are distinguished from feature phones by their stronger hardware capabilities and extensive mobile operating systems, whic ...
s, which often access the Internet from varying locations, utilize CAS to store and access user data for both convenience and data privacy purposes, such as secure
instant messaging Instant messaging (IM) technology is a type of online chat allowing real-time text transmission over the Internet or another computer network. Messages are typically transmitted between two or more parties, when each user inputs text and trigge ...
.


Implementations


Proprietary

The Centera CAS system consists of a series of networked nodes (typically large servers running
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
), divided between storage nodes and access nodes. The access nodes maintain a synchronized directory of content addresses, and the corresponding storage node where each address can be found. When a new data element, or
blob Blob may refer to: Science Computing * Binary blob, in open source software, a non-free object file loaded into the kernel * Binary large object (BLOB), in computer database systems * A storage mechanism in the cloud computing platform M ...
, is added, the device calculates a hash of the content and returns this hash as the blob's content address.Making a hash of file content Content-addressable storage uses hash algorithms.
By Chris Mellor, Published: 9 December 2003, Techworld Article moved to https://www.techworld.com/data/making-a-hash-of-file-content-235/
As mentioned above, the hash is searched to verify that identical content is not already present. If the content already exists, the device does not need to perform any additional steps; the content address already points to the proper content. Otherwise, the data is passed off to a storage node and written to the physical media. When a content address is provided to the device, it first queries the directory for the physical location of the specified content address. The information is then retrieved from a storage node, and the actual hash of the data recomputed and verified. Once this is complete, the device can supply the requested data to the client. Within the Centera system, each content address actually represents a number of distinct data blobs, as well as optional
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
. Whenever a client adds an additional blob to an existing content block, the system recomputes the content address. To provide additional data security, the Centera access nodes, when no read or write operation is in progress, constantly communicate with the storage nodes, checking the presence of at least two copies of each blob as well as their integrity. Additionally, they can be configured to exchange data with a different, e.g., off-site, Centera system, thereby strengthening the precautions against accidental data loss. IBM has another flavor of CAS which can be software-based, Tivoli Storage manager 5.3, or hardware-based, the IBM DR550. The architecture is different in that it is based on
hierarchical storage management Hierarchical storage management (HSM), also known as Tiered storage, is a data storage and Data management technique that automatically moves data between high-cost and low-cost storage media. HSM systems exist because high-speed storage devices, ...
(HSM) design which provides some additional flexibility such as being able to support not only
WORM Worms are many different distantly related bilateral animals that typically have a long cylindrical tube-like body, no limbs, and no eyes (though not always). Worms vary in size from microscopic to over in length for marine polychaete wor ...
disk but WORM tape and the migration of data from WORM disk to WORM tape and vice versa. This provides for additional flexibility in disaster recovery situations as well as the ability to reduce storage costs by moving data off the disk to tape. Another typical implementation is iCAS from iTernity. The concept of iCAS is based on containers. Each container is addressed by its hash value. A container holds different numbers of fixed content documents. The container is not changeable, and the hash value is fixed after the write process.


Open-source

One of the first content-addressed storage servers,
Venti Venti is a network storage system that permanently stores data blocks. A 160-bit SHA-1 hash of the data (called ''score'' by Venti) acts as the address of the data. This enforces a ''write-once'' policy since no other data block can be found wi ...
, was originally developed for
Plan 9 from Bell Labs Plan 9 from Bell Labs is a distributed operating system which originated from the Computing Science Research Center (CSRC) at Bell Labs in the mid-1980s and built on UNIX concepts first developed there in the late 1960s. Since 2000, Plan 9 has be ...
and is now also available for Unix-like systems as part of
Plan 9 from User Space Plan 9 from User Space (also plan9port or p9p) is a port of many Plan 9 from Bell Labs libraries and applications to Unix-like operating systems. Currently it has been tested on a variety of operating systems including: Linux, macOS, FreeBSD, Net ...
. The first step towards an open-source CAS+ implementation is Twisted Storage. Tahoe Least-Authority File Store is an open source implementation of CAS.
Git Git () is a distributed version control system: tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development. Its goals include speed, data in ...
is a
userspace A modern computer operating system usually segregates virtual memory into user space and kernel space. Primarily, this separation serves to provide memory protection and hardware protection from malicious or errant software behaviour. Kernel ...
CAS filesystem. Git is primarily used as a source code control system. git-annex is a distributed file synchronization system that uses content-addressable storage for files it manages. It relies on Git and
symbolic links In computing, a symbolic link (also symlink or soft link) is a file whose purpose is to point to a file or directory (called the "target") by specifying a path thereto. Symbolic links are supported by POSIX and by most Unix-like operating syst ...
to index their filesystem location. Project Honeycomb is an open-source
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
for CAS systems. The
XAM Xam may refer to: * XAM, a storage standard * Xẩm, a type of Vietnamese folk music * ǀXam language ǀXam pronounced , in English ) is considered an extinct language of South Africa formerly spoken by the ǀXam-ka ǃʼē of South Africa. I ...
interface was developed under the auspices of the
Storage Networking Industry Association The Storage Networking Industry Association (SNIA) is a registered 501(c)(6) non-profit trade association incorporated in December 1997. SNIA has more than 185 unique members, 2,000 active contributing members and over 50,000 IT end users and sto ...
. It provides a standard interface for archiving CAS (and CAS like) products and projects. Perkeep is a recent project to bring the advantages of content-addressable storage "to the masses". It is intended to be used for a wide variety of use cases, including distributed backup, a snapshotted-by-default, a version-controlled filesystem, and decentralized, permission-controlled filesharing. Irmin is an
OCaml OCaml ( , formerly Objective Caml) is a general-purpose programming language, general-purpose, multi-paradigm programming language which extends the Caml dialect of ML (programming language), ML with object-oriented programming, object-oriented ...
"library for persistent stores with built-in snapshot, branching and reverting mechanisms"; the same design principles as Git. Cassette is an open-source CAS implementation for C#/.NET.
Arvados
Keep is an open-source content-addressable distributed storage system. It is designed for large-scale, computationally intensive data science work such as storing and processing genomic data. Infinit is a content-addressable and decentralized (peer-to-peer) storage platform that was acquired by Docker Inc.
InterPlanetary File System The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace ...
(IPFS), is a content-addressable, peer-to-peer hypermedia distribution protocol.
casync casync (''content-addressable storage, content-addressable synchronisation'') is a Linux software utility designed to distribute frequently-updated file system images over the Internet. Utility According to the creator Lennart Poettering, casync ...
is a Linux software utility by Lennart Poettering to distribute frequently-updated file system images over the Internet.


See also

*
Content Addressable File Store The Content Addressable File Store (CAFS) was a hardware device developed by International Computers Limited (ICL) that provided a disk storage with built-in search capability. The motivation for the device was the discrepancy between the high spee ...
* Content-centric networking / Named data networking *
Data Defined Storage Data defined storage (also referred to as a data centric approach) is a marketing term for managing, protecting, and realizing value from data by uniting application, information and storage tiers. This is achieved through a process of unification ...
*
Write Once Read Many Write once read many (WORM) describes a data storage device in which information, once written, cannot be modified. This write protection affords the assurance that the data cannot be tampered with once it is written to the device, excluding the p ...


References

{{Reflist


External links


Fast, Inexpensive Content-Addressed Storage in Foundation

Venti: a new approach to archival storage
Associative arrays Computer storage devices